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Abstract 

The  Schema  Learning  System  (SLS)  automatically  as¬ 
sembles  task-specific  object  recognition  programs  from 
existing  lU  algorithms.  SLS  brings  together  two  emerg¬ 
ing  technologies  -  image  understanding  and  machine 
learning  -  to  automatically  build  customized  procedures 
for  recognizing  and  extracting  specific  object  classes  in 
constrained  contexts.  This  paper  describes  the  represen¬ 
tations  and  algorithms  underlying  SLS,  and  presents  an 
example  of  SLS  learning  to  recognize  rooftops  in  aerial 
images  of  Ft.  Hood.  This  task  is  the  first  of  several  tasks 
from  the  ARPA/ORD  sponsored  RADIUS  project  [6] 
that  SLS  is  intended  to  learn  without  human  interaction. 
In  later  experiments,  SLS  will  be  tasked  to  automatically 
construct  3D  models  of  buildings  and  other  objects  of 
interest  from  overlapping  aerial  images. 

1  Introduction 

Although  the  field  of  image  understanding  (lU)  has  made 
significant  advances  over  the  past  twenty  years,  we  have 
not  yet  developed  a  theoretical  or  practical  understanding 
of  how  the  many  components  of  vision  are  combined 
into  coherent,  functioning  systems.  As  a  result,  there 
are  few  applications  of  image  understanding  technology 
in  the  real  world,  even  though  the  library  of  available  lU 
techniques  keeps  growing.  The  problem  is  the  labor  and 
expertise  required  to  select  tbe  right  set  of  lU  algorithms 
for  a  specific  task,  and  to  combine  them  into  a  single, 
smoothly-functioning  system. 

Much  of  what  makes  the  integration  problem  difficult  is 
that  the  most  effective  combinations  of  algorithms  are 
often  object,  context  or  task  dependent.  Some  objects, 
for  example,  have  distinct  colors  that  can  be  used  to 
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focus  attention  on  particular  parts  of  an  image,  while 
others  have  easily  identifiable  substructures,  repetitive 
textures,  or  other  properties  that  help  us  to  recognize 
them  and  place  them  in  space.  Unfortunately,  the  spe¬ 
cific  features  and  techniques  needed  vary  from  object  to 
object  and  context  to  context,  so  that  most  visual  tasks 
require  specialized  solutions,  even  within  constrained 
domains  such  as  aerial  image  interpretation.  This  lim¬ 
its  the  general  use  of  image  understanding  technology, 
because  successful  vision  systems  must  be  redesigned 
and/or  hand-tuned  for  each  new  application. 

At  the  same  time,  control  engineers  have  long  modeled 
the  control  of  discrete  events  (such  as  algorithms)  as 
Markov  Decision  Problems  (MDPs).  Although  the  tra¬ 
ditional  control-theoretic  techniques  for  solving  MDPs 
(i.e.  Dynamic  Programming)  require  a  more  detailed 
process  model  than  is  generally  available  for  lU  applica¬ 
tions,  we  believe  that  recent  advances  in  reinforcement 
learning  [15, 17,  16, 18]  and  in  function  approximation 
(including,  but  not  limited  to,  backpropagation  neural 
networks  [13, 11])  make  it  possible  to  learn  near-optimal 
control  policies  for  image  understanding.  In  principle, 
one  should  be  able  to  transform  the  task  of  constructing  a 
new  vision  application  to  one  of  training  the  system  with 
a  set  of  representative  input-output  examples  relevant  for 
the  task.  Given  a  library  of  available  lU  algorithms  and 
representations,  the  goal  is  to  automatically  select  se¬ 
quences  of  algorithms  and  intermediate  representations 
to  optimize  specific  applications  while  minimizing  the 
involvement  of  the  user. 

1.1  The  Need  for  Learning  in  Complex  lU 
Applications 

In  many  ways,  the  stage  has  been  set  for  learning  con¬ 
trol  strategies  for  image  understanding  by  the  research  of 
the  past  twenty  years.  Computer  vision  researchers  have 
been  dividing  naturally  -  without  any  global  consensus  or 
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mandate  -  into  1 0  or  20  subfields  with  small,  well-defined 
problems.  This  has  led  to  the  development  of  specific 
mathematical  theories  and  algorithmic  techniques  for 
each  subdiscipline.  There  are  now  several  good  and  im¬ 
proving  algorithms  for  camera  calibration,  edge  and  line 
extraction  (straight  and  curved),  stereo  analysis,  track¬ 
ing,  depth  from  motion  (two-frame  and  multi-frame), 
shape  recovery,  and  3D  pose  determination,  to  name  just 
a  few.  Indeed,  computer  vision  researchers  have  made 
more  progress  than  most  outside  the  field  (and  many  in¬ 
side)  are  aware  of.  This  state  of  affairs  is  due  primarily 
to  our  inability  to  easily  produce  highly  visible  results  in 
the  form  of  integrated  task-specific  systems. 

The  need  for  robust  and  flexible  techniques  that  adapt 
to  the  user  without  requiring  extensive  explicit  program¬ 
ming  and  customization  is  particularly  apparent  in  image 
understanding  problems  that  exploit  context  and/or  mod¬ 
els,  both  of  which  can  be  expected  to  change  over  time. 
Vision  systems  that  learn  and  adapt  are  one  of  the  most 
important  directions  in  lU  research  right  now.  This  re¬ 
flects  an  overall  trend  -  to  make  intelligent  systems  that 
do  not  need  to  be  fully  and  painfully  programmed.  It 
is  the  only  way  for  us  to  develop  vision  systems  for  the 
military  that  are  robust  and  easy  to  use  in  many  different 
tasks. 

1.2  Learning  Strategies  for  2D  and  3D 
Building  Reconstruction 

The  ARPA/ORD  RADIUS  project  is  an  interesting  ex¬ 
ample  of  both  the  importance  of  lU  technology  and  the 
problems  with  it.  Current  military  doctrine  is  to  achieve 
dominant  battlefield  awareness  by  digitizing  the  battle¬ 
field,  which  implies  that  the  number  of  images  collected 
and  interpreted  will  have  to  increase  by  orders  of  mag¬ 
nitude.  Unfortunately,  the  number  of  image  analysts 
available  to  interpret  this  data  is  expected  to  remain  the 
same  or  even  decrease,  meaninn  that  each  analyst  will 
have  to  become  far  more  productive,  presumably  by  au¬ 
tomating  or  semi-automating  portions  of  their  task. 

To  this  end,  the  RADIUS  project  has  sought  to  develop 
lU  tools  to  automate  analysis  tasks,  such  as  (2D)  building 
detection,  (3D)  building  reconstruction,  change  detec¬ 
tion  and  road  detection.  As  part  of  this  program,  several 
universities  have  developed  new  and  original  algorithms 
for  achieving  all  or  parts  of  these  tasks.  Because  of  the 
practical  nature  of  the  RADIUS  project’s  goals,  however, 
these  universities  have  had  to  craft  not  just  isolated  al¬ 
gorithms,  but  complete,  functioning  systems.  Although 


many  of  the  underlying  algorithms  are  generally  useful, 
changes  in  the  problem  statement  -  such  as  new  im¬ 
age  domains  and/or  new  types  of  sensors  -  have  often 
meant  that  the  overall  system  had  to  be  retuned,  if  not 
overhauled  entirely,  in  order  to  work  on  the  new  task. 

This  is  exactly  the  type  of  problem  that  SLS  is  meant  to 
address,  so  we  have  adopted  the  RADIUS  data  and  task 
statements  as  a  test  domain  for  SLS.  This  paper  presents 
some  early  results  of  a  major  experiment  in  using  SLS 
to  accomplish  RADIUS-project  tasks  such  as  building 
detection  and  reconstruction.  The  first  of  these  tasks 
(described  below)  is  to  recognize  the  image  positions  of 
rooftops  in  aerial  photographs.  Several  other  universities 
have  previously  addressed  this  problem  [7, 9]  developing 
hand-crafted  strategies  for  finding  rooftops  by  grouping 
line  segments,  analyzing  shadows,  and  exploiting  other 
2D  image  cues.  Our  goal  for  SLS  is  to  automatically 
learn  an  equivalent  (or  better)  strategy  based  on  the  same 
type  of  information. 

Ideally,  SLS’s  library  should  contain  the  same  subrou¬ 
tines  used  in  other  rooftop  recognition  projects.  Unfor¬ 
tunately,  SLS’s  procedures  must  be  executable  UNIX 
modules,  while  many  of  the  subroutines  developed  for 
RADIUS  are  Lisp  functions  embedded  in  RCDE  [10]. 
Therefore,  a  small  set  of  RADIUS-style  vision  routines 
(recoded  as  stand-alone  C  or  C-I-+  modules)  have  been 
used  for  this  experiment,  many  although  not  all  from 
UMass.  In  this  paper,  the  challenge  for  SLS  is  to  find  a 
control  policy  for  applying  these  algorithms  that  maxi¬ 
mizes  system  performance  on  the  (2D)  roof  recognition 
task. 

The  strength  of  SLS  is  not  only  that  it  produces  effective 
control  policies,  but  that  it  becomes  possible  to  recon¬ 
figure  the  system  for  new  tasks  as  they  arise.  Currently, 
SLS  has  access  to  only  a  small  set  of  lU  algorithms, 
most  of  which  extract  2D  information  from  a  single  im¬ 
age.  In  the  next  few  months,  however,  we  expect  more 
algorithms  to  be  added  to  this  library,  including  3D  al¬ 
gorithms  for  computing  digital  elevation  maps  (DEMs) 
from  stereo  image  pairs  and  algorithms  for  fitting  planes 
and  other  surfaces  to  the  DEM  data.  As  each  new  algo¬ 
rithm  is  added,  SLS  can  learn  a  new  control  policy  that 
learns  how  best  to  take  advantage  of  the  new  routine  in 
conjunction  with  the  2D  and  3D  algorithms  already  in 
its  library.  SLS  can  also  learn  new  control  policies  to 
adapt  to  changing  goals,  as  we  expand  the  system  from 
finding  flat-roofed  buildings  to  constructing  3D  build¬ 
ing  models  of  flat-roofed  buildings  and  eventually  to 


constructing  models  of  buildings  with  multi-level  roofs, 
peaked  roofs,  curved  roofs,  and/or  large  structures  (such 
as  air  conditioning  or  water  storage  units)  on  the  roof. 

2  Markov  Decision  Problems 

Control  engineers  have  long  modeled  the  control  of  dis¬ 
crete  processes  as  a  Markov  Decision  Problem  (MDP). 
Although  even  a  brief  tutorial  on  MDPs  is  beyond  the 
scope  of  this  article,  MDPs  can  be  pictured  in  terms 
of  systems  (similar  to  finite  state  machines)  having  a 
discrete  set  of  states  and  making  discrete  transitions  be¬ 
tween  states  as  a  result  of  actions.  Initially,  the  system 
starts  in  state  sq;  in  response  to  an  action  (call  it  Oq),  the 
system  is  advanced  to  a  new  state  (call  it  si);  the  system 
is  also  given  a  reward  (or  penalty)  for  making  the  transi¬ 
tion  from  state  Sq  to  state  Si.  The  next  action  Ci,  applied 
at  state  si,  advances  the  system  to  state  S2  at  time  <2  (and 
gives  it  another  reward  or  penalty),  and  so  on  until  the 
system  reaches  a  terminal  state.  The  goal  of  a  Markov 
Decision  Problem  is  to  select  a  sequence  of  states  and 
actions  Sq,  cto,  si,  ai,  •••,  Sn,  On  that  maximizes  the  total 
reward  of  reaching  a  terminal  state.  (MDPs  can  also 
maximize  the  total  reward  as  t  >  00,  but  for  the  pur¬ 
poses  of  this  paper  we  will  limit  the  discussion  to  tasks 
with  terminal  states.) 

Two  features  of  the  MDP  formalism  are  particularly  im¬ 
portant.  One  is  that  the  MDP  formalism  is  stochastic, 
each  action  has  a  probability  function  associated  with 
it  that  gives  the  probability  of  transitioning  to  state  Sk 
from  state  sj  (written  as  juf ,  s*)).  The  other  is  that 
the  solution  to  a  Markov  Decision  Problem  is  a  control 
policy,  often  expressed  as  a  table,  that  maps  actions  onto 
states  so  as  to  maximize  the  expected  total  reward.  (This 
is  necessary  since  the  outcomes  of  actions  are  stochas¬ 
tic.) 

Many  readers  may  be  familiar  with  Dynammic  Program¬ 
ming,  a  set  of  techniques  for  computing  optimal  control 
policies  for  MDPs,  given  that  the  transition  probabili¬ 
ties  and  rewards  are  known  a-priori  for  each  action/state 
pair.  Unfortunately,  such  process  models  are  often  un¬ 
available,  and  reinforcement  learning  is  a  branch  of  ma¬ 
chine  learning  research  that  seeks  to  learn  optimal  con¬ 
trol  policies  for  MDPs  without  knowing  the  transition 
probabilities  in  advance,  generally  by  developing  em¬ 
pirical  estimates  of  these  probabilities  on-line. 

Another  property  of  the  MDP  formalism  are  the  V"^  (s) 
and  Q^(s,a)  functions.  Intuitively,  ^^"(s)  is  the  ex¬ 


pected  reward  from  starting  in  state  s  and  following 
control  policy  tt  until  a  terminal  state  is  reached;  the 
V’’^(s)  function  is  also  sometimes  called  the  state  value 
function.  Q'^(s,  a)  is  the  expected  reward  from  starting 
in  state  s,  applying  action  a,  and  then  following  control 
policy  TT  thereafter  until  a  terminal  state  is  reached;  the  Q 
function  is  sometimes  called  the  state/action  value  func¬ 
tion.  The  basis  of  Dynammic  Programming  algorithms 
is  that  given  aprocess  model,  they  compute  V*  (s)  and/or 
Q*(s,  a)  for  every  state  or  state/action  pair,  where  *  is 
the  optimal  policy.  The  V’*(s)  and  Q*(s,  a)  functions 
can  then  be  used  to  generate  the  optimal  policy  table  by 
selecting,  for  every  state  s,  the  action  a  with  the  highest 
Q*(s,  a)  value. 

3  The  Schema  Learning  System 

The  Schema  Learning  System  (SLS)  uses  reinforcement 
learning  and  neural  networks  to  automatically  assem¬ 
ble  computer  vision  algorithms  into  working  special- 
purpose  computer  vision  systems.  It  accomplishes  this 
by  casting  image  understanding  as  a  Markov  Decision 
Problem  (MDP),  in  which  the  reward  function  is  a  mea¬ 
sure  of  the  accuracy  of  the  final  object  hypothesis^.  The 
control  policies  learned  by  SLS  reason  across  multiple 
levels  of  representation,  selecting  the  next  action  at  each 
step  as  a  function  of  the  ‘knowledge  state”  of  the  system. 
The  overall  goal  of  SLS  is  to  learn  policies  that  produce 
accurate  object  hypotheses,  thereby  maximizing  the  total 
reward. 

3.1  Actions 

The  levels  of  representation  in  SLS  are  a  product  of  the 
visual  procedure  library.  Each  procedure  is  declared  to 
have  an  input  data  type(s)  and  an  output  data  type(s). 
For  example,  an  edge  extraction  procedure  is  applied 
to  images  and  produces  edges,  while  an  edge  grouping 
operator  is  applied  to  edges  and  produces  either  straight 
lines  or  curves.  The  library  therefore  defines  both  the 
visual  procedures  and  the  levels  of  representation  that 
SLS  can  reason  across. 

Not  surprisingly,  the  visual  procedures  in  the  library  are 
the  actions  of  the  MDP.  The  states  of  the  system  cor¬ 
respond  to  the  data  items  (called  tokens)  produced  by 
visual  procedures.  For  example,  in  the  rooftop  recogni¬ 
tion  scenario,  the  initial  state  of  the  system  corresponds 
to  an  image.  If  the  action  selected  is  an  edge  operator, 

“■In  general,  the  reward  functions  may  trade  off  accuracy  for 
efficiency,  but  we  will  not  consider  that  here. 


then  this  will  produce  a  set  of  edges  which  becomes  the 
new  state  of  the  system.  Thus  the  edge  operator  action 
transitions  the  system  from  the  initial  image  state  to  the 
edge  state. 

3.2  State  Spaces 

Clearly,  not  all  sets  of  edges  are  the  same;  neither  are 
all  images,  polygons,  etc.  In  order  to  leam  sophisticated 
control  policies,  it  is  necessary  to  distinguish  good  (high 
quality)  data  from  bad.  Unfortunately,  it  is  not  obvious 
how  to  divide  any  given  level  of  representation  into  a 
discrete  set  of  states  a-priori  (although  in  the  past  we  have 
learned  policies  by  training  decision  trees  to  divide  each 
representation’s  space  of  tokens  into  discrete  states  [5]). 

As  an  alternative,  we  note  that  Markov  Decision  Prob¬ 
lems  can  be  defined  over  infinite  state  spaces.  In  this 
formulation,  the  control  policy  becomes  a  function  that 
maps  points  in  the  (now  infinite)  state  space  onto  ac¬ 
tions,  and  the  and  Q‘^{s,a)  functions  similarly 

range  over  infinite  state  spaces.  In  particular,  in  SLS 
we  define  a  state  space  for  every  level  of  representation, 
so  that  an  action  maps  points  in  one  space  to  points  in 
another  space,  depending  on  the  level  of  representation 
of  its  input  and  output.  To  return  to  the  edge  operator  as 
an  example,  it  maps  points  in  image  space  onto  points  in 
edge  space. 

Each  level  of  representation  therefore  has  a  state  space 
associated  with  it.  This  space  is  defined  by  the  set  of 
measurable  features  defined  for  that  representation.  For 
example,  in  the  current  experiment  images  have  four 
measurable  features:  average  intensity,  standard  devi¬ 
ation,  edginess  and  speckle.  The  image  state  space  is 
therefore  defined  as  a  four  dimensional  space  with  each 
feature  as  one  dimension.  Any  particular  image  is  rep¬ 
resented  as  a  point  in  this  space.  As  another  example, 
one  of  the  representation  used  heavily  in  the  rooftop  ex¬ 
periments  is  the  parallel  line  pair  (i.e.  two  lines  that  are 
parallel  to  each  other).  The  parallel  line  pair  space  is  five 
dimensional,  corresponding  to  the  five  features  that  we 
have  defined  for  parallel  line  pairs:  relative  angle,  ex¬ 
tent  of  overlap,  average  contrast,  shadowness  (whether 
a  shadow  lies  just  to  the  outside  of  either  line),  and  the 
variance  of  the  intensity  (pixel)  values  between  the  lines. 

Mathematically,  this  is  a  clean  formulation  of  the  prob¬ 
lem.  In  practice,  it  only  works  if  we  can  leam  estimate 
for  the  Q*(s,  a)  or  V’*(s)  functions  over  these  infinite 
state  spaces  from  a  finite  number  of  samples.  Inspired  by 
Tesauro  [16]  andZhang  and  Dietterich  [18],  weuse  back- 


propagation  neural  networks  to  leam  approximations  to 
the  (5  *  (s,  a)  function  for  each  action,  as  described  below 
in  Section  3.4.  The  control  policies  learned  by  SLS  are 
therefore  represented  as  a  set  of  neural  networks,  each 
approximating  the  Q  function  for  a  state  space  /  action 
pair.  At  mn-time,  the  best  action  to  apply  to  a  state  (i.e. 
data  token)  is  the  action  with  the  highest  estimated  Q 
value  given  its  feature  vector. 

3.3  Backtracking  and  the  State/Action  Queue 

As  a  first  pass,  SLS  can  therefore  be  visualized  as  a 
system  that  begins  with  a  data  token  sq.  typically  an 
image.  SLS  evaluates  the  function  (5(soi  for  every 
action  a  that  can  be  applied  to  token  (and  state)  sq,  and 
selects  the  action  with  the  highest  estimated  total  reward. 
The  action  (oq)  is  then  applied  to  data  Sq,  creating  a 
new  data  token  si.  SLS  then  selects  the  action  with 
the  highest  estimate  for  Q(si,  a)  and  applies  it,  creating 
state  S2.  This  process  repeats  itself,  creating  a  sequence 
of  states  and  actions  Sqj  Ooi  “ii  "-i  until  a  data  token 
at  the  target  level  of  representation  (for  example,  2D 
roof  hypothesis)  is  created.  Such  a  token  represents  a 
terminal  state  for  the  system. 

The  problem  with  this  simplified  version  of  SLS  is  that 
many  lU  procedures  -  particularly  matching  procedures 
-  do  not  produce  a  single  data  item  as  output.  One 
of  the  visual  procedures  in  the  library  for  recognizing 
rooftops,  for  example,  is  a  graph  matching  procedure 
that  looks  for  a  rectangle  given  a  set  of  line  segments. 
Depending  on  the  number  of  rectangles  in  the  image, 
the  matching  procedure  may  produce  zero,  one  or  many 
rectangle  hypotheses.  Consequently,  when  this  action  is 
applied  to  a  data  state  Sj,  it  may  produce  many  (or  no) 
possible  successor  states  Sj . 

SLS  therefore  maintains  a  state/action  queue.  When  an 
action  produces  multiple  results  sji, ...,  Sj„,  it  evaluates 
the  Q  function  for  each  new  state/action  pair,  and  creates 
a  state/action  queue  sorted  by  the  estimated  Q  values.  It 
then  applies  the  highest  rated  state/action  pair,  producing 
zero,  one  or  more  new  data  states,  which  are  then  added 
to  the  state/action  queue. 

The  state/action  queue^  does  more  than  just  allow  SLS 
to  accomodate  visual  procedures  that  return  an  indeter¬ 
minate  number  of  arguments.  It  also  allows  SLS  to 
backtrack  during  the  interpretation  process,  if  necessary. 
When  SLS  selects  an  action,  it  does  so  because  the  esti- 

^  We  refrain  from  calling  it  the  Q-Queue. 


mated  reward  for  that  action/state  pair  is  high  (meaning 
that  it  expects  it  to  lead  to  a  good  goal-level  hypoth¬ 
esis).  Sometimes,  however,  this  estimate  is  erroneous, 
and  the  selected  action  creates  low-quality  data  (i.e.  data 
with  low  estimated  V  (s)  values).  In  this  case,  SLS  will 
not  pursue  the  bad  hypothesis.  Since  new  action/state 
pairs  made  from  the  bad  hypothesis  will  have  lower  Q 
estimates  than  some  of  the  unexecuted  state/action  pairs 
already  on  the  queue,  SLS  will  backtrack  to  try  one  of 
these  previously  unexecuted  actions.  In  essence,  unlike 
most  MDP  applications,  SLS  is  maintaining  a  complete 
search  tree,  and  using  the  Q  and  V  functions  as  heuristics 
to  select  which  nodes  to  expand,  in  a  manner  similar  to 
A*  search. 

3.4  Computing  Q  and  V 

There  are  several  reinforcement  learning  algorithms  for 
estimating  Q  or  V  without  a-priori  process  models,  with 
TD(A)  [15]  and  Q-Leaming  [17]  being  the  best  known. 
These  algorithms  build  successively  better  approxima¬ 
tions  of  Q  and  V  based  on  experience  gained  by  running 
the  system;  Tesauro  [16]  and  Zhang  and  Dietterich  [18] 
have  used  these  techniques  with  neural  network  func¬ 
tion  approximators  to  learn  Q  values  for  systems  with 
continuous  state  spaces. 

One  problem  with  these  techniques  is  that  they  have  to 
execute  tens  of  thousands  of  actions  to  converge  to  good 
Q  and  V  estimates  -  even  more  when  continuous  state 
spaces  are  used.  In  an  application  domain  in  which  each 
action  may  take  a  minute  or  more  to  execute,  we  do  not 
have  enough  time  to  directly  apply  these  techniques. 

Fortunately,  very  few  sequences  of  visual  procedures  are 
very  long;  in  general,  it  only  takes  between  five  and  ten 
lU  prodecures  to  get  from  an  image  to  an  object  hypoth¬ 
esis.  As  a  result,  although  the  search  trees  associated 
with  object  recognition  problems  have  high  branching 
factors,  they  are  typically  not  very  deep.  It  is  possible  to 
do  an  exhaustive  search  of  a  limited  number  of  training 
images  and  use  this  data  to  estimate  Q  and  V.  (In  the 
rooftop  recognition  experiment,  the  exhaustive  search 
took  about  8  hours  per  training  image;  less  when  the 
known  positions  of  rooftops  in  the  training  images  were 
used  to  constrain  the  set  of  hypotheses). 

In  particular,  for  every  data  instance  created  from  a  train¬ 
ing  image  during  exhaustive  search,  we  can  compute  the 
reward  of  the  best  goal-level  token  that  can  be  created 
from  it.  These  reward  values  are  samples  of  a  function 
a),  which  represents  the  best  reward  that  can  be 


computed  starting  from  a  state/action  pair.  a) 

should  not  be  confused  with  Q*(s,a),  the  estimated 
reward  for  the  optimal  control  policy.  It  may  not  be 
possible  to  select  the  best  action  at  every  state  because 
of  ambiguity  in  the  feature  vectors,  so  a)  is  an 

optimistic  estimate  of  Q*{s,  a)  (hence  the  term  Q®*"*), 
with  the  property  that  ,a)  >  Q*(s,a)Vs,a.  In 

fact,  as  the  data  features  become  more  discriminating, 
(s,  a)  approaches  Q*  (s,  a)  from  above.  Most  im¬ 
portantly,  Q°P*(s,a)  can  be  computed  from  far  fewer 
training  samples  than  Q*  (s,  a),  because  each  a) 

is  independent  of  every  other,  so  the  neural  nets  can  be 
trained  separately. 

3.5  Properties  of  SLS 

Readers  who  are  familiar  with  MDPs  may  want  a  clari¬ 
fication  of  certain  technical  points: 

•  Markov  State  Properties.  The  theory  behind 
MDPs,  Dynammic  Programming  and  Reinforce¬ 
ment  Learning  relies  on  the  so-called  "Markov 
Assumption"  that  the  transition  probabilities  for  a 
state/action  pair  depend  only  on  the  state  (and  the 
action),  and  not  on  any  previous  states.  Formally, 
it  assumes  that; 

P(si; |sj ,  (ij)  =  P(sfc  |si, ...,  Sj,  Uj) 

Clearly,  this  is  only  approximately  true  for  the  states 
in  SLS,  as  represented  by  the  feature  vectors  associ¬ 
ated  with  data  instances.  (Strictly  speaking,  SLS’s 
states  should  be  called  situations.)  Consequently, 
the  formal  proofs  that  TD(A)  and  Q-Leaming  con¬ 
verge  to  the  optimal  control  policy  do  not  apply. 

•  Optimality  of  Q°^*.  Even  if  SLS  had  tme  states 

(instead  of  situations),  a  greedy  policy  with  respect 
to  would  not  be  optimal.  It  is  easy  to  constract 

artificial  examples  in  which  (s,  a)  ^  Q*  {s,  a) 
for  some  state/action  pairs  and  not  for  others,  lead¬ 
ing  to  inefficient  behavior.  However,  because 

is  an  optimistic  estimate  and  SLS  uses  a  state/action 
queue,  applying  (s,  a)  will  always  produce  the 
optimal  answer,  even  it  sometimes  executes  unnec- 
cessary  actions  along  the  way  (assuming  trae  states 
and  perfect  function  approximators,  and  therefore 
an  accurate  estimate  of  a)). 


Representation  Features 


Image 

Avg.  Intensity,  SD 

Edginess,  Speckle 

LinePairSet 

Count,  Avg  Contrast 

Parallel  Line  Pair 

angle,  overlap,  shadowness 
surface  fit,  distance 

Comer  (L-Junction) 

angle,  gap,  shadowness 
surface  fit,  scale 

Line  Group 

Count,  Parallel  Count 

Comer  Count 

Polygon 

Edge  Cover  % 

Worst  Edge  Cover 

Avg  Perimeter  Contrast 

Worst  Side  Contrast 

Plane  fit  error  (intensity  data) 
scale,  shadowness 

Table  1 :  Features  defined  for  each  level  of  representation  in 
the  visual  procedure  library  for  recognizing  rooftops. 


4  Experiment  -  Detecting  Rooftops 

As  has  already  been  discussed  extensively,  we  assigned 
SLS  the  task  of  finding  rooftops  in  aerial  images  of 
Ft.  Hood,  a  task  that  was  copied  from  the  RADIUS 
project  task  domain.  On  each  trial,  the  system  is  given 
a  subimage  containing  one  or  sometimes  two  buildings, 
and  a  set  of  3D  line  segments  computed  as  described 
in  [4],  SLS  is  also  given  a  visual  procedure  library  that 
defines  eight  levels  of  representation  and  nine  visual 
procedures.  The  levels  of  representation  correspond  to 
images,  sets  of  3D  line  segments,  parallel  line  pairs, 
comers  (i.e.  vertices  of  orthogonal  lines),  line  groups 
and  polygons.  Because  much  of  the  power  of  SLS  lies 
in  its  ability  to  distinguish  good  data  from  bad  based  on 
feature  measurements.  Table  1  gives  the  set  of  features 
defined  for  each  level  of  representation. 

4.1  The  Visual  Procedure  Library 

The  visual  procedures  employed  are  meant  to  approxi¬ 
mate  some  of  the  techniques  being  used  by  researchers  in 
the  RADIUS  project.  The  3D  line  segments  were  com¬ 
puted  off-line  as  described  in  [4],  and  filtered  according 
to  the  known  height  of  the  ground  plane.  Eight  other  vi¬ 
sual  procedures  are  available.  The  rectilinear  line  group¬ 
ing  (RLGS)  procedure  is  an  updated  version  of  [12]  that 
computes  relationships  between  nearby  line  segments;  it 
was  updated  to  use  information  about  the  camera  pose 
(available  for  all  RADIUS  images)  to  remove  the  effects 
of  perspective  distortion  from  orthogonal  and  parallel  re¬ 


lations.  The  SelectParallel  and  SelectComer  procedures 
select  parallel  line  pairs  and  comers  out  of  the  relations 
computed  by  RLGS. 

The  grouping  procedures  (GroupParallel  and  Grou- 
pLJnct)  take  a  pair  of  parallel  lines  (or  a  comer)  and 
form  a  group  out  of  all  the  lines  that  share  a  significant 
relation  to  one  of  the  lines  in  the  original  pair.  This 
results  in  a  small  group  of  line  segments  in  which  the 
Graph  Matching  procedure  can  search  for  a  rectangle  of 
lines.  Alternatively,  given  a  pair  of  parallel  or  orthogo¬ 
nal  line  segments,  the  Par2Polygon  and  Comer2Polygon 
algorithms  go  back  to  the  original  image  data  and  apply 
edge  extraction  and  edge  linking  operators  top  down,  in 
order  to  look  for  evidence  of  additional  lines  that  might 
complete  the  rectangle.  Finally,  the  Polygon2Roof  pro¬ 
cedure  serves  no  purpose  other  than  to  give  SLS  a  way 
to  designate  a  particular  polygon  as  a  roof. 

At  first  glance,  the  visual  procedure  library  would  ap¬ 
pear  to  have  only  four  paths  to  the  goal,  which  would 
make  SLS’s  task  fairly  easy.  The  procedure  for  selecting 
comers,  however,  typically  finds  on  the  order  of  fifty  to 
one  hundred  comers  per  image,  while  the  procedure  for 
finding  parallel  line  pairs  typically  finds  twice  that  many. 
As  a  result,  there  are  approximately  five  hundred  poly¬ 
gons  that  SLS  might  detect  in  most  images,  and  most  of 
the  work  that  SLS  does  is  in  selecting  which  hypotheses 

-  in  terms  of  which  comers,  parallel  pairs,  and  polygons 

-  to  pursue. 

4.2  Detecting  Rooftops 

SLS  was  tested  on  a  set  of  ten  image  fragments  taken 
from  the  RADIUS  project’s  images  of  Ft.  Hood.  The 
training  and  evaluation  was  carried  out  using  the  ground 
trath  data  developed  for  the  (self)  evaluation  of  UMass’s 
hand-crafted  building  detector  and  3D  reconstmction 
system  [4]. 

SLS  was  tested  using  a  “leave  one  out”  methodology. 
On  each  trial,  SLS  was  be  trained  on  data  from  nine 
images,  and  then  the  control  policy  it  learned  would  be 
applied  to  the  tenth  image.  This  was  repeated  ten  times, 
each  time  holding  a  different  image  out  of  the  train¬ 
ing  set  for  testing.  Figure  1  shows  one  of  the  rooftops 
extracted  by  SLS.  Over  ten  trials,  SLS  found  a  polygon 
that  corresponded  to  a  trae  roof  surface  nine  times;  in  the 
tenth  trial,  SLS  confused  part  of  the  roof  boundary  with 
shadow  line  that  was  near  to  (and  parallel  with)  the  trae 
edge  of  the  roof,  creating  an  incorrect  roof  hypothesis. 


Figure  1 :  One  of  the  ten  aerial  images  of  buildings  at  Ft. 
Hood,  and  the  roof  hypothesis  (shown  in  white)  SLS  learned 
to  find  in  it. 


The  contol  policies  learned  by  SLS  did  not  always  take 
a  straight  path  to  the  solution.  Although  they  always 
prefered  finding  comers  to  parallel  lines,  they  would 
often  select  one  comer  as  being  the  most  promising, 
use  it  to  generate  a  polygon  hypothesis,  and  then  decide 
that  the  polygon  was  not  as  good  as  it  thought  it  would 
be.  (In  other  words,  the  V  value  of  the  hypothesis  was 
lower  than  the  Q  value  of  the  action  that  created  it.) 
The  system  would  then  backtrack,  find  the  next  most 
promising  comer,  and  try  again.  In  general,  the  system 
created  ten  to  twenty  polygon  hypotheses  (out  of  several 
hundred  possible  ones)  before  finding  the  polygon  it 
finally  selects  to  be  the  rooftop. 

Significantly,  SLS  can  adapt  quickly  to  the  introduction 
of  new  procedures  or  features.  The  first  time  we  tested 
the  system  on  the  Ft.  Hood  data,  it  succeeded  in  only 
about  half  the  trials.  Looking  at  its  mistakes,  we  real¬ 
ized  it  was  often  confusing  shadows  with  the  edges  of  the 
buildings  that  cast  them.  We  then  added  a  shadowness 
feature  to  the  parallel  pair,  comer  and  polygon  represen¬ 
tations,  and  immediately  SLS’s  performance  improved. 
In  general,  we  suspect  that  this  is  how  SLS  will  be  used 


-  as  a  system  to  which  people  add  knowledge  until  it 
is  able  to  accomplish  the  assigned  task.  Ironically,  it 
could  therefore  be  viewed  as  a  very  good  knowledge 
engineering  tool. 

5  Future  Experimental  Plans:  3D  Building 
Reconstruction 

The  goals  of  the  RADIUS  project  go  beyond  simply  rec¬ 
ognizing  objects  in  aerial  images  and  determining  their 
image  position.  One  of  the  goals  of  RADIUS  is  to  pro¬ 
vide  the  image  analyst  with  tools  that  automatically  con- 
stmct  3D  models  of  buildings  from  overlapping  aerial 
views.  Although  more  thorough  testing  of  SLS  on  the 
2D  recognition  task  is  also  planned,  the  primary  empha¬ 
sis  in  the  future  will  be  on  getting  SLS  to  learn  control 
policies  for  3D  building  reconstmction. 

Although  there  are  clues  to  3D  stmcture  in  monocular 
images  that  SLS  is  not  currently  taking  advantage  of 
(such  as  the  sun  angle  and  length  of  shadows),  we  be¬ 
lieve  that  what  will  make  3D  building  reconstmction 
far  more  effective  is  the  depth  information  provided  by 
overlapping  aerial  views.  The  UMass  terrain  reconstmc¬ 
tion  system  [14]  constmcts  dense  digital  elevation  maps 
(DEMs)  accurate  to  within  a  meter  from  pairs  of  images, 
even  when  those  images  were  taken  with  wide  baselines 
at  highly  oblique  angles.  This  type  of  dense,  3D  data,  in 
combination  with  the  3D  lines  computed  in  [4],  should 
make  it  possible  to  reconstract  highly  accurate  building 
models.  These  procedures,  along  with  additional  pro¬ 
cedures  for  fitting  planes  and  complex  surfaces  to  DEM 
data,  should  give  SLS  many  alternative  strategies  for 
constmcting  3D  building  models. 

SLS’s  task  will  be  to  combine  the  new  3D  procedures 
with  the  previous  2D  set  to  form  accurate  and  efficient 
control  policies.  Although  the  3D  reconstmction  results 
are  not  available  at  the  time  of  printing,  we  invite  readers 
to  visit  our  URL  site  to  see  how  the  experiments  are 
going; 

http:  Wvis-www.  cs  .umass  .  edu 
\projects\SLS3D.html 

6  Conclusions 

Over  the  last  twenty  years,  image  understanding  re¬ 
searchers  have  developed  many  effective  algorithms  for 
solving  visual  subproblems,  such  as  edge  and  line  extrac¬ 
tion,  stereo  analysis,  tracking  and  pose  determination. 


Unfortunately,  we  have  not  developed  a  comprehensive 
theory  of  how  these  algorithms  might  be  put  together 
into  functioning  systems,  with  the  result  that  many  ad¬ 
vances  in  lU  have  yet  to  see  their  way  into  military  or 
civilian  applications. 

The  Schema  Learning  project  is  an  attempt  to  understand 
both  the  science  and  the  practice  of  combining  lU  algo¬ 
rithms  into  special-purpose  vision  system,  by  casting 
the  control  of  image  understanding  as  a  Markov  Deci¬ 
sion  Problem.  At  an  abstract  level,  this  requires  defining 
state  spaces  for  lU  and  algorithms  for  calculating  the 
Q  and  V  functions;  more  practically,  it  requires  build¬ 
ing  a  system  capable  of  integrating  many  different  lU 
algorithms. 

The  Schema  Learning  System  (SLS)  is  a  first  pass  at 
such  an  algorithm  and  system.  This  paper  presents  the 
results  of  an  early  experiment  in  which  SLS  was  able 
to  learn  control  policies  that  successfully  found  rooftops 
in  aerial  images  in  nine  out  of  ten  trials.  The  near- 
term  future  goal,  however,  is  to  learn  control  policies  for 
automatically  constructing  3D  building  models. 
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